SemGen - Towards a Semantic Data Generator for Benchmarking Duplicate Detectors
نویسندگان
چکیده
Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability. In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.
منابع مشابه
Generating Synthetic RDF Data with Connected Blank Nodes for Benchmarking
Generators for synthetic RDF datasets are very important for testing and benchmarking various semantic data management tasks (e.g. querying, storage, update, compare, integrate). However, the current generators do not support sufficiently (or totally ignore) blank node connectivity issues. Blank nodes are used for various purposes (e.g. for describing complex attributes), and a significant perc...
متن کاملPushing the Limits of Instance Matching Systems: A Semantics-Aware Benchmark for Linked Data
The architectural choices behind the Data Web have led to the publication of large interrelated data sets that contain different descriptions for the same real-world objects. Due to the mere size of current online datasets, such duplicate instances are most commonly detected (semi-)automatically using instance matching frameworks. Choosing the right framework for this purpose remains tedious, a...
متن کاملClone Detection by Comparing Abstract Memory States
In this paper, we propose a new semantic clone detection technique by comparing programs’ abstract memory states, which are computed by a semantic-based static analyzer. Our experimental study using three large-scale open source projects shows that our technique can detect semantic clones that existing syntacticor semantic-based clone detectors miss. Our technique can help developers identify i...
متن کاملBenchmarking RDF Query Engines: The LDBC Semantic Publishing Benchmark
The Linked Data paradigm which is now the prominent enabler for sharing huge volumes of data by means of Semantic Web technologies, has created novel challenges for non-relational data management technologies such as RDF and graph database systems. Benchmarking, which is an important factor in the development of research on RDF and graph data management technologies, must address these challeng...
متن کاملTowards constructing an Integrative, Multi-Level Model for Cognition: The Function of Semantic Networks
Integrated approaches try to connect different constructs in different theories and reinterpret them using a common conceptual framework. In this research, using the concept of processing levels, an integrated, three-level model of the cognitive systems has been proposed and evaluated. Processing levels are divided into three categories of Feature-Oriented, Semantic and Conceptual Level based o...
متن کامل